1.1 - What is RL for SC2
The documentation so far has focused on building scripted agents. In that paradigm, you, the developer, explicitly write the rules (if/then/else
logic) that determine the bot's behavior. Reinforcement Learning (RL) introduces a fundamentally different approach: instead of writing the rules, you define a goal, and the agent learns its own rules through trial and error.
This section will guide you through building a learned agent, where the "brain" of the bot is a neural network model trained using libraries like stable-baselines3
.
Two Paradigms- Scripted vs. Learned
Feature | Scripted Agent (Traditional Bot) | Learned Agent (RL Bot) |
---|---|---|
Developer's Role | Write explicit rules for every scenario. | Design the learning environment, observation, actions, and reward signal. |
Decision Making | Based on a pre-defined logic tree. | Based on a learned policy that maps observations to actions. |
Behavior | Predictable and consistent. | Can be creative and discover non-obvious strategies. |
Primary Challenge | Anticipating all possible game states. | Designing a reward function that leads to desired behavior; long training times. |
The Core RL Workflow
At its heart, RL is a feedback loop between an Agent and an Environment.
+------------------+ +------------------+
| |-----------| |
| Agent | Action | Environment |
| (Your RL Model) |---------->| (StarCraft II) |
| | | |
+-------^----------+ +--------v---------+
| |
| Observation + Reward |
+--------------------------------+
- The Agent observes the state of the Environment.
- Based on the observation, the Agent chooses an Action.
- The Environment updates based on the Action and returns a new Observation and a Reward signal.
- The Agent uses the Reward to update its internal strategy (its Policy) to take better actions in the future.
Key Terminology
Term | Definition in our SC2 Context |
---|---|
Agent | The stable-baselines3 model we will train. This is the "brain." |
Environment | The StarCraft II game, wrapped in a custom gymnasium interface. |
Observation | A numerical representation of the game state (e.g., an array with minerals, supply, unit health). |
Action | A discrete choice the Agent makes (e.g., 0 for "do nothing," 1 for "build worker"). |
Reward | A numerical score (+1 , -10 , etc.) we give the Agent to reinforce good behavior and punish bad behavior. |
Policy | The internal strategy of the Agent; essentially, a function that decides which action to take for a given observation. |
Our Goal for This Section
Our project is to build the bridge between python-sc2
and stable-baselines3
and train agents to complete specific tasks.
- Task 1: Create a reusable
gymnasium
environment for StarCraft II. - Task 2: Train an agent to perform a simple economic task.
- Task 3: Train an agent to make a simple macroeconomic trade-off.
- Task 4: Train an agent to perform a simple combat micro task.
Why is RL Challenging in StarCraft II?
Applying RL to a complex game like StarCraft II is difficult for two main reasons:
- Massive State Space: The number of possible game states is nearly infinite. We must carefully select a small subset of features for our agent to observe.
- Credit Assignment Problem: A game can last for minutes. If you win, was it because of a good decision you made 30 seconds ago or 10 minutes ago? Designing reward functions that correctly assign credit is a major challenge.
In the following chapters, we will address these challenges by starting with very simple, focused tasks.